library(climatic4economist)2: Extract spatial data based on spatial polygons
1 Introduction
This tutorial show how too extract spatial control variables based on surveys administrative divisions. The survey refers to Ethiopia 2019 and comes from the The World Bank Living Standards Measurement Study (LSMS). The spatial variables are nighttime light, agroecological zones, Urban-Rural Catchment Area, elevation, and climatic parameters. More information about these datasets are in the Appendix.
2 Code
2.1 Set Up
We start by setting up the stage for our analysis. First, we load the necessary packages. We load only climatic4economist package that contains several functions meant to extract and merge spatial variables with surveys. During the tutorial we will use other packages but instead of loading all the package at the begging we will call specific function each time.
In the setup, we also want to create the paths to the various data sources and load the necessary functions for extraction. Note .. means one step back to the folder directory, i.e. one folder back.
Note that how to set up the paths depends on your folder organization but there are overall two approaches:
- you can use the
R project, by opening the project directly you don’t need to set up the path to the project. Automatically the project figures out on its own where it is located in the computer and set that path as working folder. - you can manually set the working folder with the function
setwd().
# path to data folder
path_to_data <- file.path("..",
"..", "data")
# survey and administrative division
path_to_survey <- file.path(path_to_data, "survey", "LSMS", "LSMS_ETH19.dta")
path_to_adm_div <- file.path(path_to_data, "adm_div", "geoBoundaries")
# weather variables
path_to_pre <- file.path(path_to_data, "weather", "ERA5_Land", "AFR", "monthly",
"afr_month_50_25_tpr.nc")
path_to_tmp <- file.path(path_to_data, "weather", "ERA5_Land", "AFR", "monthly",
"afr_month_50_25_tmp.nc")
# control variables
path_to_elevation <- file.path(path_to_data, "spatial", "elevation", "GloFAS",
"elevation_glofas_v4_0.nc")
path_to_urca <- file.path(path_to_data, "spatial", "URCA",
"URCA.tif")
path_to_pop <- file.path(path_to_data, "spatial", "population", "WorldPop",
"uncontraint_1km_global", "ppp_2019_1km_Aggregated.tif")
path_to_nightlight <- file.path(path_to_data, "spatial", "nighttime_light",
"VIIRS", "VNL_v21_npp_2019_global_vcmslcfg_c202205302300.average_masked.dat.tif")
path_to_aez <- file.path(path_to_data, "spatial", "AgroEcological", "AEZ",
"GAEZv5", "GAEZ-V5.AEZ33-10km.tif")
# to result folder
path_to_result <- file.path(path_to_data, "result")- 1
- concatenate the string to make a path
- 2
-
..means one folder back
2.2 Read the Data
2.2.1 Survey Data
We start by reading the surveys data. The survey is stored as dta file, so we use the haven::read_dta() function to read it.
We only need the hhid, the survey coordinates, and the interview dates. We use dplyr::select() to choose these variables. This passage is optional and we bring with us all the variables, but we won’t use them.
Then we create/modify some variables with the function dplyr::mutate(). We transform the the variable interview_date from string into data, and we get the year of the median value of the date of interviews. This passage is important as it allows us to define the most appropriate year to select for the spatial variables.
srvy <- haven::read_dta(path_to_survey) |>
dplyr::select(survey_year, hhid, country, lat, lon, interview_date) |>
dplyr::mutate(
interview_date = clock::date_parse(interview_date,
format = "%Y-%m-%d"),
survey_year = clock::get_year(median(interview_date)),
.before = hhid)- 1
- read dta type data
- 2
- select relevant variables
- 3
- transform string into date type
- 4
- specify format type
- 5
- find the median year of the interviews
2.2.2 Spatial Data
Finally, we load the spatial data. This data typically comes in the form of raster data. A raster represents a two-dimensional image as a rectangular matrix or grid of pixels. These are spatial rasters because they are georeferenced, meaning each pixel (or “cell” in GIS terms) represents a square region of geographic space. The value of each cell reflects a measurable property (either qualitative or quantitative) of that region.
To spatial data is usually stored as tif file or nc. We can read both of them them with the function terra::rast().
When we print the raster, we obtain several key details. The dimension tells us how many cells the raster consists of and the number of layers, each layer corresponds to a particular months for which the observations were made. We also get the spatial resolution, which defines the size of each square region in geographic space, and the coordinate reference system (CRS), i.e. EPSG:4326.
When working with multiple spatial data, you must ensure that they have the same coordinate reference system (CRS). This is important because in this way all the data can “spatially” talk to each other.
pop <- terra::rast(path_to_pop) |>
setNames("pop")
pop
nightlight <- terra::rast(path_to_nightlight) |>
setNames("nightlight")
nightlight
elevation <- terra::rast(path_to_elevation)
elevation
urca <- terra::rast(path_to_urca)
urca
aez <- terra::rast(path_to_aez) |>
setNames("aez")
aez- 1
- read raster type data
- 2
- change the name of the layer
class : SpatRaster
dimensions : 18720, 43200, 1 (nrow, ncol, nlyr)
resolution : 0.008333333, 0.008333333 (x, y)
extent : -180.0012, 179.9987, -72.00042, 83.99958 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source : ppp_2019_1km_Aggregated.tif
name : pop
class : SpatRaster
dimensions : 33601, 86401, 1 (nrow, ncol, nlyr)
resolution : 0.004166667, 0.004166667 (x, y)
extent : -180.0021, 180.0021, -65.00208, 75.00208 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source : VNL_v21_npp_2019_global_vcmslcfg_c202205302300.average_masked.dat.tif
name : nightlight
class : SpatRaster
dimensions : 3000, 7200, 1 (nrow, ncol, nlyr)
resolution : 0.05, 0.05 (x, y)
extent : -180, 180, -60, 90 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84
source : elevation_glofas_v4_0.nc
varname : elevation (Height above sea level)
name : elevation
unit : m
class : SpatRaster
dimensions : 17235, 43200, 1 (nrow, ncol, nlyr)
resolution : 0.008333333, 0.008333333 (x, y)
extent : -180, 180, -60, 83.625 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source : URCA.tif
name : URCA
class : SpatRaster
dimensions : 2160, 4320, 1 (nrow, ncol, nlyr)
resolution : 0.08333333, 0.08333333 (x, y)
extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source : GAEZ-V5.AEZ33-10km.tif
name : aez
Now we also read the weather observation. The same consideration about the coordinate reference system (CRS) is still valid. When we work with raster that have also observations over time, it is important to check how and where the time and date information is stored. Sometimes it is stored in the metadata and you can access it using terra::time(), other time it is already saved as the name of the layer and you can access it using names(). Sometimes, like in this case the date information is stored in the names but the format is based on second passed from 1970-01-01 00:00. To transform this observation into readable date we can use the function second_to_date().
Note that rasters can store time information in different ways, so it may not always be possible to retrieve dates in this manner. A common alternative is for dates to be embedded in the layer names, in which case we wouldn’t need to rename the layers.
pre <- terra::rast(path_to_pre)
pre
names(pre) <- terra::names(pre) |> second_to_date()
pre
tmp <- terra::rast(path_to_tmp)
names(tmp) <- terra::names(tmp) |> second_to_date()- 1
- transform the layers name with second into dates
class : SpatRaster
dimensions : 741, 811, 904 (nrow, ncol, nlyr)
resolution : 0.1, 0.1 (x, y)
extent : -26.05, 55.05, -36.05, 38.05 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84
source : afr_month_50_25_tpr.nc:tp
varname : tp (Total precipitation)
names : tp_va~52000, tp_va~73600, tp_va~54400, tp_va~76000, tp_va~84000, tp_va~05600, ...
unit : m, m, m, m, m, m, ...
class : SpatRaster
dimensions : 741, 811, 904 (nrow, ncol, nlyr)
resolution : 0.1, 0.1 (x, y)
extent : -26.05, 55.05, -36.05, 38.05 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84
source : afr_month_50_25_tpr.nc:tp
varname : tp (Total precipitation)
names : 1950-01-01, 1950-02-01, 1950-03-01, 1950-04-01, 1950-05-01, 1950-06-01, ...
unit : m, m, m, m, m, m, ...
2.2.3 Administrative Boundaries
We now move to read the administrative divisions. We use the function read_geoBoundaries() to do so. This function looks for spatial polygons for the iso and lvl provided provided.
As we have the coordinates, we don’t actually need the administrative divisions for the extraction. However, we will use it to reduce the coverage of the spatial variables and to make some plots.
The same consideration about the coordinate reference system (CRS) is still valid.
adm_div <- read_geoBoundaries(path_to_adm_div, iso = "ETH", lvl = 2)
adm_div class : SpatVector
geometry : polygons
dimensions : 74, 4 (geometries, attributes)
extent : 33.00224, 47.95925, 3.400365, 14.84602 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
names : ID_adm_div iso adm_div_1 adm_div_2
type : <chr> <chr> <chr> <chr>
values : 1 ETH Somali Afder
2 ETH Gambela Agnuak
3 ETH SNNPR Alaba
2.3 Georeference the Surveys
As we’ve mentioned, the spatial data is georeferenced, so we need to ensure the same for the survey data. We use the spatial coordinates to assign the administrative division to each household. We must ensure that we can later associate the correct weather data with the right household, we do this by creating an merging variable called ID.
This is handled by the prepare_coord() function, which requires the coordinates’ variable names as input.
Once we have the unique coordinates, we are ready to transform them into spatial points using the georef_coord() function. When performing this transformation, it’s crucial to set the correct CRS, which must match that of the weather data. The CRS is provided as an argument of the function, using the previously saved CRS from the weather data. Also the georef_coord() function requires the coordinates’ variable names as input. Usually, the WGS 84 CRS is the default coordinate references system for coordinates. In this case it matches the weather coordinate references system.
We can print the result to check the transformation. The new column, ID, is created by prepare_coord() and identifies each unique coordinate. This is used to merge the weather data with the household data.
srvy_coord <- prepare_coord(srvy,
lon_var = lon,
lat_var = lat)
srvy_coord# A tibble: 6,505 × 7
ID survey_year hhid country lat lon interview_date
<chr> <int> <chr> <chr> <dbl> <dbl> <date>
1 1 2019 051103088801903002 Ethiopia 3.61 39.0 2019-08-28
2 1 2019 051103088801903012 Ethiopia 3.61 39.0 2019-08-28
3 1 2019 051103088801903022 Ethiopia 3.61 39.0 2019-08-28
4 1 2019 051103088801903032 Ethiopia 3.61 39.0 2019-08-29
5 1 2019 051103088801903042 Ethiopia 3.61 39.0 2019-08-29
6 1 2019 051103088801903052 Ethiopia 3.61 39.0 2019-08-28
7 1 2019 051103088801903062 Ethiopia 3.61 39.0 2019-08-28
8 1 2019 051103088801903072 Ethiopia 3.61 39.0 2019-08-28
9 1 2019 051103088801903082 Ethiopia 3.61 39.0 2019-08-28
10 1 2019 051103088801903092 Ethiopia 3.61 39.0 2019-08-29
# ℹ 6,495 more rows
Once we have the unique coordinates, we are ready to transform them into spatial points using the georef_coord() function. When performing this transformation, it’s crucial to set the correct CRS, which must match that of the weather data. The function also the coordinates’ variable names as input.
srvy_geo <- georef_coord(srvy_coord,
geom = c("lon", "lat"),
crs = "EPSG:4326")
srvy_geo class : SpatVector
geometry : points
dimensions : 516, 1 (geometries, attributes)
extent : 33.43483, 47.30784, 3.609384, 14.47715 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
names : ID
type : <chr>
values : 1
2
3
2.4 Merge administrative division and survey
We want to associated the survey location to the administrative divisions. We do it by looking in which administrative division each survey location fall in. We save this information for later use.
srvy_adm_div <- get_poly_attr_for_point(srvy_geo, adm_div)
srvy_adm_div# A tibble: 516 × 7
ID lon lat ID_adm_div iso adm_div_1 adm_div_2
<chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 1 39.0 3.61 10 ETH Oromia Borena
2 2 41.8 4.01 37 ETH Somali Liben
3 3 41.9 4.44 1 ETH Somali Afder
4 4 41.5 4.73 37 ETH Somali Liben
5 5 36.0 4.74 56 ETH SNNPR South Omo
6 6 38.1 4.85 10 ETH Oromia Borena
7 7 37.4 4.96 10 ETH Oromia Borena
8 8 40.7 5.11 37 ETH Somali Liben
9 9 41.9 5.15 1 ETH Somali Afder
10 10 44.6 5.24 1 ETH Somali Afder
# ℹ 506 more rows
2.5 Crop the spatial variables
The spatial variables variables we have just load have a global coverage. It might be convenient to reduce the coverage to just the countries we are interested in. We can do this by using the crop_with_buffer() function and the administrative divisions. As the name suggest, this function allows to specify a buffer around the vector data to increase the spatial extent and crop a larger portion. This is useful as some survey coordinates are at the edge of the administrative borders or, in some rare cases, just outside the borders as consequence of the coordinates modification fro location anonymization. Further, to compute some spatial indicators in one cell we need the surrounding cell values and if we crop exactly at the borders those cell values at the edge won’t have the he surrounding cells.
The buffer argument of the function specifies the increase around the spatial extent. By default, it is in the same unit of measure of the data.
This is not a compulsory step but it reduce the memory burden and allows for more meaningful plotting.
pop_cntry <- crop_with_buffer(pop, adm_div, buffer = 1)
nghtlght_cntry <- crop_with_buffer(nightlight, adm_div, buffer = 1)
elevatn_cntry <- crop_with_buffer(elevation, adm_div, buffer = 1)
urca_cntry <- crop_with_buffer(urca, adm_div, buffer = 1)
aez_cntry <- crop_with_buffer(aez, adm_div, buffer = 1)
pre_cntry <- crop_with_buffer(pre, adm_div, buffer = 1)
tmp_cntry <- crop_with_buffer(tmp, adm_div, buffer = 1)2.6 Plot
A good practice when working with spatial data is to plot it. This is the best way to verify that everything is working as expected.
First, we plot the survey coordinates to ensure they are correctly located within the country and to examine their spatial distribution.
terra::plot(adm_div, col = "grey", main = "District of Ethiopia and Survey Coordinates")
terra::points(srvy_geo, col = "red", alpha = 0.5, cex = 0.5)- 1
- plot raster
- 2
- add survey locations
We confirm that the survey locations are within the country borders, which is great! We also observe that the spatial distribution of survey coordinates is neither random nor uniform; most are concentrated near the major cities and in the North.
Next, we plot the spatial variables to see how it overlaps with the spatial coordinates.
terra::plot(elevatn_cntry, main = "Elevation")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "red", alpha = 0.5, cex = 0.5)- 1
- plot raster
- 2
- add administrative borders
- 3
- add survey locations
terra::plot(log(1+pop_cntry), main = "Log Population")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "red", alpha = 0.5, cex = 0.5)terra::plot(urca_cntry, main = "URCA")
terra::lines(adm_div, col = "black", lwd = 2)
terra::points(srvy_geo, col = "red", alpha = 1, cex = 0.6)terra::plot(log(1+nghtlght_cntry), main = "Log Nighttime Light")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "red", alpha = 1, cex = 0.6)terra::plot(tmp_cntry, "2024-03-01", col = terra::map.pal("water"),
main = "Monthly precipitation at 2024-03 and survey location")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "red", alpha = 0.5, cex = 0.5)terra::plot(tmp_cntry, "2024-03-01", col = terra::map.pal("ryb"),
main = "Monthly temperature at 2024-03 and survey location")
terra::lines(adm_div, col = "white", lwd = 1)
terra::points(srvy_geo, col = "black", alpha = 0.5, cex = 0.5)Once again, the survey coordinates align with the precipitation data, which is great! We can also observe the different spatial resolution, with precipitation having a lower one. The consequence is that some survey coordinates still fall within the same cell.
2.7 Modify the Spatial Variables
2.7.1 Compute Terrain Indicators
Now we compute some terrain indicators based on elevation. The terrain indicators are:
TRI (Terrain Ruggedness Index) is the mean of the absolute differences between the value of a cell and its 8 surrounding cells.
Slope is the average difference between the value of a cell and its 8 surrounding cells.
Roughness is the difference between the maximum and the minimum value of a cell and its 8 surrounding cells.
terrain_cntry <- terra::terrain(elevatn_cntry,
v = c("slope", "TRI", "roughness"),
neighbors = 8,
unit = "degrees")
nghtlght_cntry$ln_nightlight <- log(1 + nghtlght_cntry)- 1
- the terrain indicators
- 2
- how many neighboring cells, 8 (queen case) or 4 (rook case)
2.7.2 Weather Variable Transformation
The original unit of measure of the weather data is in meter for precipitation and Kelvin for temperature. These unit of measure are not very intuitive, therefore we change them into millimeter and Celsius respectively.
# From meter to millimeters
pre_cntry_mm <- pre_cntry*1000
# From Kelvin to Celsius
tmp_cntry_c <- tmp_cntry - 273.152.7.3 Compute the Surface Area of the Administrative Divisons
We now compute the surface are of each administrative division. We will use it for computing the population density. We use the function dplyr::mutate() to add the variable area_km and the function terra::expanse() to compute the surface area.
adm_div_area <- adm_div |>
dplyr::mutate(area_km = terra::expanse(adm_div, unit = "km"))2.8 Extraction
2.8.1 Spatial Variables
We extract the spatial data based on the administrative division using the extract_by_poly() function. This function requires the raster with the spatial data, the administrative division, and the aggregation function as input. The aggregation function, fn_agg, defines how the cell values that fall within an administrative division are combined into a single value. Note that by default all the cell values are weighted by the coverage area of the cell that fall within the division.
Contrary to the other spatial variable, for population we use adm_div_area to extract the values as we need the surface area to calculate the population density.
Looking at the result, we see first the ID_adm_div column, that identifies the unique administrative divisions. The second to fourth column are the additional information coming from the administrative division data. The last column contain the spatial observations aggregated at the administrative division.
pop_adm <- extract_by_poly(pop_cntry, adm_div_area, fn_agg = "sum")
pop_adm# A tibble: 74 × 6
ID_adm_div iso adm_div_1 adm_div_2 area_km pop
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 1 ETH Somali Afder 62044. 791202.
2 2 ETH Gambela Agnuak 23507. 250289.
3 3 ETH SNNPR Alaba 863. 326562.
4 4 ETH Oromia Arsi 20974. 3761471.
5 5 ETH Beneshangul Gumu Asosa 14397. 513373.
6 6 ETH Amhara Awi/Agew 8950. 1249906.
7 7 ETH Oromia Bale 54644. 2011606.
8 8 ETH SNNPR Basketo 419. 79602.
9 9 ETH SNNPR Bench Maji 19085. 921810.
10 10 ETH Oromia Borena 52437. 1406311.
# ℹ 64 more rows
nghtlght_adm <- extract_by_poly(nghtlght_cntry, adm_div, fn_agg = "mean")
elevation_adm <- extract_by_poly(elevatn_cntry, adm_div, fn_agg = "mean")
terrain_adm <- extract_by_poly(terrain_cntry, adm_div, fn_agg = "mean")
urca_adm <- extract_by_poly(urca_cntry, adm_div, fn_agg = "modal")
aez_adm <- extract_by_poly(aez_cntry, adm_div, fn_agg = "modal")
aez_adm# A tibble: 74 × 5
ID_adm_div iso adm_div_1 adm_div_2 aez
<chr> <chr> <chr> <chr> <dbl>
1 1 ETH Somali Afder 26
2 2 ETH Gambela Agnuak 2
3 3 ETH SNNPR Alaba 5
4 4 ETH Oromia Arsi 6
5 5 ETH Beneshangul Gumu Asosa 2
6 6 ETH Amhara Awi/Agew 5
7 7 ETH Oromia Bale 1
8 8 ETH SNNPR Basketo 26
9 9 ETH SNNPR Bench Maji 26
10 10 ETH Oromia Borena 1
# ℹ 64 more rows
2.8.2 Weather Variables
Fro weather data, we use a different function for extracting the data, namely extract_cell_by_poly(). Contrary to the function extract_by_poly(), this doesn’t aggregate the values within the polygons but extract each all the cell values within the division separately. This is important as we want to compute the long run climatic parameter for cell and only later aggregate them.
To extract each cells is more computationally and memory demanding, especially with large countries and long time series, but it increases precision as the aggregation, thus lost of information, is done at very last stage of the process.
Looking at the result, we see first the ID_adm_div column, that identifies the unique administrative divisions. The second and third column are the coordinates of the cells. The fourth is the amount of the cell that actually falls within the administrative division. The other columns contain the weather observations over time specific to each cell.
pre_cell <- extract_cell_by_poly(pre_cntry_mm, adm_div)
tmp_cell <- extract_cell_by_poly(tmp_cntry_c, adm_div)tmp_cell# A tibble: 11,845 × 908
ID_adm_div x_cell y_cell coverage_fraction X1950_01_01 X1950_02_01
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 41 6.7 0.0309 23.0 24.3
2 1 41.1 6.7 0.398 23.8 25.1
3 1 41.2 6.7 0.0640 23.9 25.2
4 1 40.9 6.6 0.0118 22.0 23.3
5 1 41 6.6 0.673 22.9 24.2
6 1 41.1 6.6 1 23.8 25.1
7 1 41.2 6.6 0.887 24.2 25.5
8 1 41.3 6.6 0.430 24.6 25.8
9 1 41.4 6.6 0.0915 24.6 25.8
10 1 41.8 6.6 0.103 26.9 28.0
# ℹ 11,835 more rows
# ℹ 902 more variables: X1950_03_01 <dbl>, X1950_04_01 <dbl>,
# X1950_05_01 <dbl>, X1950_06_01 <dbl>, X1950_07_01 <dbl>, X1950_08_01 <dbl>,
# X1950_09_01 <dbl>, X1950_10_01 <dbl>, X1950_11_01 <dbl>, X1950_12_01 <dbl>,
# X1951_01_01 <dbl>, X1951_02_01 <dbl>, X1951_03_01 <dbl>, X1951_04_01 <dbl>,
# X1951_05_01 <dbl>, X1951_06_01 <dbl>, X1951_07_01 <dbl>, X1951_08_01 <dbl>,
# X1951_09_01 <dbl>, X1951_10_01 <dbl>, X1951_11_01 <dbl>, …
2.9 Cmpute Long Run Climatic Parameter
We want to describe the long run climatic condition in each locations. Rule of thumb is to use 30 years of weather observations to capture climatic features. Therefore, we select the 30 years before each survey.
Check the names with the date of observations and how it has changed since before.
pre_cell_30yrs <- select_by_dates(pre_cell, from = "1989", to = "2019" )
tmp_cell_30yrs <- select_by_dates(tmp_cell, from = "1989", to = "2019")
tmp_cell_30yrs# A tibble: 11,845 × 365
ID_adm_div x_cell y_cell coverage_fraction X1989_01_01 X1989_02_01
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 41 6.7 0.0309 24.3 24.2
2 1 41.1 6.7 0.398 25.0 24.9
3 1 41.2 6.7 0.0640 25.0 25.0
4 1 40.9 6.6 0.0118 23.5 23.3
5 1 41 6.6 0.673 24.1 24.1
6 1 41.1 6.6 1 24.9 24.9
7 1 41.2 6.6 0.887 25.3 25.3
8 1 41.3 6.6 0.430 25.5 25.4
9 1 41.4 6.6 0.0915 25.4 25.3
10 1 41.8 6.6 0.103 27.4 27.8
# ℹ 11,835 more rows
# ℹ 359 more variables: X1989_03_01 <dbl>, X1989_04_01 <dbl>,
# X1989_05_01 <dbl>, X1989_06_01 <dbl>, X1989_07_01 <dbl>, X1989_08_01 <dbl>,
# X1989_09_01 <dbl>, X1989_10_01 <dbl>, X1989_11_01 <dbl>, X1989_12_01 <dbl>,
# X1990_01_01 <dbl>, X1990_02_01 <dbl>, X1990_03_01 <dbl>, X1990_04_01 <dbl>,
# X1990_05_01 <dbl>, X1990_06_01 <dbl>, X1990_07_01 <dbl>, X1990_08_01 <dbl>,
# X1990_09_01 <dbl>, X1990_10_01 <dbl>, X1990_11_01 <dbl>, …
Now we can compute the long run climatic parameter. We calculate the mean, the standard deviation, and the coefficient of variation. We collect all the parameter in a separate object parameter. This object is a names list of functions and we construct it with this structure name = function, then the list() function puts them together. This passage is not compulsory but allows to perform the computation of multiple parameters in a tidy and efficient way. Otherwise we could have directly add them inside the calc_par().
The function calc_par() calculates the required parameters.
The results have a similar structure, with the first columns that identify the specific locations and the other the computed parameters. Note how we are still carrying on the coverage_fraction variable as we will need it for aggregating the climatic parameter at the administrative division.
parameter <- list(std = sd, avg = mean, coef_var = cv)
parameter$std
function (x, na.rm = FALSE)
sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm))
<bytecode: 0x17e832da0>
<environment: namespace:stats>
$avg
function (x, ...)
UseMethod("mean")
<bytecode: 0x12cadd2f8>
<environment: namespace:base>
$coef_var
function(x, na_rm = TRUE) {
avg <- mean(x, na.rm = na_rm)
if (is.nan(avg)) return(NA_real_)
if (avg == 0) return(NA_real_) # Avoid division by zero
sd(x, na.rm = na_rm) / mean(x, na.rm = na_rm)
}
<bytecode: 0x17e834978>
<environment: namespace:climatic4economist>
pre_par_cell <- calc_par(pre_cell_30yrs, pars = parameter, prefix = "pre")
tmp_par_cell <- calc_par(tmp_cell_30yrs, pars = parameter, prefix = "tmp")
tmp_par_cell# A tibble: 11,845 × 7
ID_adm_div x_cell y_cell coverage_fraction tmp_std tmp_avg tmp_coef_var
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 40.9 5.7 0.249 1.33 26.8 0.0497
2 1 40.9 5.8 0.673 1.37 27.1 0.0505
3 1 40.9 5.9 0.725 1.37 26.6 0.0516
4 1 40.9 6 0.728 1.32 25.5 0.0518
5 1 40.9 6.1 0.811 1.28 24.8 0.0517
6 1 40.9 6.2 0.719 1.29 24.5 0.0525
7 1 40.9 6.3 0.780 1.25 24.1 0.0520
8 1 40.9 6.4 0.892 1.24 23.8 0.0521
9 1 40.9 6.5 0.509 1.22 23.4 0.0520
10 1 40.9 6.6 0.0118 1.18 22.9 0.0515
# ℹ 11,835 more rows
We have computed the climatic parameters for each cells but we still need to aggregate them at the administrative divisions. The function agg_to_adm_div() can do it for us, be aware the the function aggregate by using the weighted mean, where the weights are provided by the coverage_fraction variable.
In the results we lose the the information on the specific cells and we are left only with the administrative division id, ID_adm_div, and a single value of the climatic parameters for each locations.
pre_par_adm <- agg_to_adm_div(pre_par_cell , match_col = "pre")
tmp_par_adm <- agg_to_adm_div(tmp_par_cell, match_col = "tmp")
tmp_par_adm# A tibble: 74 × 4
ID_adm_div tmp_std tmp_avg tmp_coef_var
<chr> <dbl> <dbl> <dbl>
1 1 1.27 27.4 0.0464
2 10 1.59 22.5 0.0715
3 11 1.84 20.0 0.0920
4 12 1.65 20.9 0.0789
5 13 1.76 21.6 0.0808
6 14 1.17 26.5 0.0441
7 15 1.68 18.3 0.0919
8 16 1.18 21.3 0.0558
9 17 1.40 20.2 0.0697
10 18 2.11 20.4 0.103
# ℹ 64 more rows
2.10 Merge with Survey
Now that we have everything, we can combine all the extracted data and then merge them with the survey. We start by combining the data into a unique data set. To do so we start by create a list with the function list(), each element of the list is a different spatial variable and then we combine the elements of the list with the function purrr::reduce(). This last function require another function as input to drive the combination and we choose to use merge_by_common(), which merges two data by their common variable names.
Why not using directly merge_by_common()? Because the function works with just two datasets and we have eight different spatial datasets. We can cumulatively merge the datasets one by one or we can use the purrr::reduce().
Then we compute also the population density and the logarithmic transformation of the nighttime light. We use the dplyr::mutate() function to add these two new variables. We use the argument .after to specify where the position of the variable among the columns.
sptl_adm <- list(pop_adm,
nghtlght_adm,
terrain_adm,
elevation_adm,
urca_adm,
aez_adm,
pre_par_adm,
tmp_par_adm) |>
purrr::reduce(merge_by_common) |>
dplyr::mutate(pop_density = pop/area_km, .after = pop)
sptl_adm- 1
- combine the data into a list
- 2
- merge all the elements of the list
- 3
- create new variables
# A tibble: 74 × 21
ID_adm_div iso adm_div_1 adm_div_2 area_km pop pop_density nightlight
<chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 ETH Somali Afder 62044. 7.91e5 12.8 0.000139
2 2 ETH Gambela Agnuak 23507. 2.50e5 10.6 0.00118
3 3 ETH SNNPR Alaba 863. 3.27e5 378. 0.0387
4 4 ETH Oromia Arsi 20974. 3.76e6 179. 0.0108
5 5 ETH Beneshangul… Asosa 14397. 5.13e5 35.7 0.00389
6 6 ETH Amhara Awi/Agew 8950. 1.25e6 140. 0.00539
7 7 ETH Oromia Bale 54644. 2.01e6 36.8 0.00182
8 8 ETH SNNPR Basketo 419. 7.96e4 190. 0
9 9 ETH SNNPR Bench Ma… 19085. 9.22e5 48.3 0.00289
10 10 ETH Oromia Borena 52437. 1.41e6 26.8 0.00143
# ℹ 64 more rows
# ℹ 13 more variables: ln_nightlight <dbl>, slope <dbl>, TRI <dbl>,
# roughness <dbl>, elevation <dbl>, URCA <dbl>, aez <dbl>, pre_std <dbl>,
# pre_avg <dbl>, pre_coef_var <dbl>, tmp_std <dbl>, tmp_avg <dbl>,
# tmp_coef_var <dbl>
Now that we have all the control variables together, we can merge them with the surveys information. The function merge_by_common() will do it for us.
However, the surveys do not carry information on the administrative division we have used, therefore we need an additional step to provide this information. We calculated this link information before and save it as srvy_adm_div.
We first merge the link information with the spatial extracted variables, the output is then merge with the survey. Note that the pipe command |> assumes that the left side is the first argument in the function, as it is not the case for us we need to specify it with y = _, where y is the name of the argument and _ refer to the previous merge.
We can see that the result has all the information we retained from the surveys, the information about the administrative divisions, and the new extracted spatial variables.
srvy_sptl_adm <- merge_by_common(srvy_adm_div, sptl_adm) |>
merge_by_common(srvy_coord, y = _)
srvy_sptl_adm- 1
- merge adm info with spatial var
- 2
-
_refers to the output of the previous merge
# A tibble: 6,506 × 28
ID survey_year hhid country lat lon interview_date ID_adm_div iso
<chr> <int> <chr> <chr> <dbl> <dbl> <date> <chr> <chr>
1 1 2019 051103… Ethiop… 3.61 39.0 2019-08-28 10 ETH
2 1 2019 051103… Ethiop… 3.61 39.0 2019-08-28 10 ETH
3 1 2019 051103… Ethiop… 3.61 39.0 2019-08-28 10 ETH
4 1 2019 051103… Ethiop… 3.61 39.0 2019-08-29 10 ETH
5 1 2019 051103… Ethiop… 3.61 39.0 2019-08-29 10 ETH
6 1 2019 051103… Ethiop… 3.61 39.0 2019-08-28 10 ETH
7 1 2019 051103… Ethiop… 3.61 39.0 2019-08-28 10 ETH
8 1 2019 051103… Ethiop… 3.61 39.0 2019-08-28 10 ETH
9 1 2019 051103… Ethiop… 3.61 39.0 2019-08-28 10 ETH
10 1 2019 051103… Ethiop… 3.61 39.0 2019-08-29 10 ETH
# ℹ 6,496 more rows
# ℹ 19 more variables: adm_div_1 <chr>, adm_div_2 <chr>, area_km <dbl>,
# pop <dbl>, pop_density <dbl>, nightlight <dbl>, ln_nightlight <dbl>,
# slope <dbl>, TRI <dbl>, roughness <dbl>, elevation <dbl>, URCA <dbl>,
# aez <dbl>, pre_std <dbl>, pre_avg <dbl>, pre_coef_var <dbl>, tmp_std <dbl>,
# tmp_avg <dbl>, tmp_coef_var <dbl>
2.11 Write
Here we are at the end, let’s save the results. We want to save the result as dta so we will use the haven::write_dta() function.
haven::write_dta(srvy_sptl_adm,
file.path(path_to_result, "ETH_sp_adm.dta"))3 Take home messages
When working with multiple spatial data:
- remember to control the Coordinate Reference System of all dataset
- plot the data to check everything is going well
based on the typology of data we use different function
reading
- for
dtausehaven::read_dta() orandhaven::_write_dta() - for spatial vectors use
terra::vect()orread_adm_div()for administrative divisions in specific country and level. - for spatial raster use
terra::rast()
- for
extraction
- spatial points, use
extract_by_coord() - spatial polygons, use
extract_by_poly() - cells within polygons, use
extract_cell_by_poly()
- spatial points, use
When working with raster data
- check the unit of measure
- if it is a time series check also the date format
When working with spatial polygons, like administrative divisions valuate if you want to extract the values already aggregated or each cells separately
- for example the terrain indicators were computed for each cells and then we moved to the extraction
- for the climatic parameters, we extract each cells separately, we compute the parameters for each cells, and only later we aggregate them.
When georeferencing the survey location we take advantage that many interviews share the same locations. Hence, we extract the variables just for these unique locations. However, same of these unique locations may fall within the same value cells, so the actual information might be even lower.
4 Appendix
4.1 Want to know about the data?
4.1.1 Weather
Weather observation are obtained from ERA5-Land reanalysis dataset. H-TESSEL is the land surface model that is the basis of ERA5-Land. The data is a post-processed monthly-mean average of the original ERA5-Land dataset.
| Parameter | Value |
|---|---|
| spatial resolution | 0.1° x 0.1° lon lat |
| temporal resolution | month |
| time frame | Jan. 1950 - Dec. 2022 |
| unit of measure | meter or Kelvin |
Suggested citation:
- Muñoz Sabater, J. (2019): ERA5-Land monthly averaged data from 1950 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). DOI: 10.24381/cds.68d2bb30
It is possible to find additional information:
The data can be freely download from
- here.
4.1.1.0.1 Total precipitation
Accumulated liquid and frozen water, including rain and snow, that falls to the Earth’s surface. It is the sum of large-scale precipitation and convective precipitation. Precipitation variables do not include fog, dew or the precipitation that evaporates in the atmosphere before it lands at the surface of the Earth.
4.1.1.0.2 2 metre above ground temperature
Temperature of air at 2m above the surface of land, sea or in-land waters. 2m temperature is calculated by interpolating between the lowest model level and the Earth’s surface, taking account of the atmospheric conditions.
4.1.2 Spatial variables
4.1.2.1 Agro Ecological Zones
The Agro-ecological Zones classification (33 classes) provides a characterization of bio-physical resources relevant to agricultural production systems. AEZ definitions and map classes follow a rigorous methodology and an explicit set of principles. The inventory combines spatial layers of thermal and moisture regimes with broad categories of soil/terrain qualities. It also indicates locations of areas with irrigated soils and shows land with severely limiting bio-physical constraints including very cold and very dry (desert) areas as well as areas with very steep terrain or very poor soil/terrain conditions. The AEZ classification dataset is part of the GAEZ v5 Land and Water Resources theme and Agro-ecological Zones sub-theme. All results are derived from the Agro-ecological Zones (AEZ) modeling framework, developed collaboratively by the Food and Agriculture Organization (FAO) and the International Institute for Applied Systems Analysis (IIASA).
| Parameter | Value |
|---|---|
| spatial resolution | 10 km. |
| temporal resolution | 20 years |
| time frame | 2001–2020 |
| unit of measure | classification by climate/soil/terrain/LC (33 classes) |
Suggested citation:
- FAO & IIASA. 2025. Global Agro-ecological Zoning version 5 (GAEZ v5) Model Documentation. https://github.com/un-fao/gaezv5/wiki
It is possible to find additional information:
- here.
The data can be freely download from:
- here.
4.1.2.2 Urban-Rural Catchment Area (URCA)
Urban–rural catchment areas showing the catchment areas around cities and towns of different sizes (the no data value is 128). Each rural pixel is assigned to one defined travel time category to one of seven urban agglomeration sizes.
| Parameter | Value |
|---|---|
| spatial resolution | 0.03° x 0.03° lon lat |
| temporal resolution | year |
| time frame | 2015 |
| unit of measure | travel time category to different urban hierarchy |
Suggested citation:
- Cattaneo, Andrea; Nelson, Andy; McMenomy, Theresa (2020). Urban-rural continuum. figshare. Dataset. https://doi.org/10.6084/m9.figshare.12579572.v4
It is possible to find additional information:
- here.
The data can be freely download from:
- here.
4.1.2.3 Population
The units are number of people per pixel. The mapping approach is Random Forest-based dasymetric redistribution.
| Parameter | Value |
|---|---|
| spatial resolution | 30 arc second (~1km) |
| temporal resolution | year |
| time frame | 2010 - 2020 |
| unit of measure | estimated count of people per grid-cell |
Suggested citation:
- WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00647
It is possible to find additional information from:
The data can be freely download from:
- here.
4.1.2.4 Nighttime light
VIIRS nighttime lights (VNL) version V2.1: annual values obtained by from the monthly averages with filtering to remove extraneous features such as biomass burning, aurora, and background.
| Parameter | Value |
|---|---|
| spatial resolution | 15 arc second |
| temporal resolution | year |
| time frame | 2012 - 2021 |
| unit of measure | nW/cm2/sr, average-masked |
Suggested citation:
- Elvidge, C.D, Zhizhin, M., Ghosh T., Hsu FC, Taneja J. Annual time series of global VIIRS nighttime lights derived from monthly averages:2012 to 2019. Remote Sensing 2021, 13(5), p.922, doi:10.3390/rs13050922
It is possible to find additional information:
- here.
The data can be freely download from:
- here.
4.1.2.5 Elevation
The Global Flood Awareness System (GloFAS) is one component of the Copernicus Emergency Management Service (CEMS). It is designed to support preparatory measures for flood events worldwide, particularly in large transnational river basins.
Elevation is obtained from the auxiliary variables of GloFAS. Each pixel is the mean height elevation above sea level.
| Parameter | Value |
|---|---|
| spatial resolution | 0.03° x 0.03° lon lat |
| temporal resolution | 30 years |
| time frame | 1981 - 2010 |
| unit of measure | Meter (m) |
Web resources:
Data access:
4.1.3 Survey
The Living Standards Measurement Study - Integrated Surveys on Agriculture (LSMS-ISA) is a unique system of longitudinal surveys designed to improve the understanding of household and individual welfare, livelihoods and smallholder agriculture in Africa. The LSMS team works with national statistics offices to design and implement household surveys with a strong focus on agriculture.
Suggested citation:
- Central Statistics Agency of Ethiopia. (2020). Socioeconomic Survey 2018-2019 [Data set]. World Bank, Development Data Group. https://doi.org/10.48529/K739-C548
It is possible to find additional information:
- here.
The data can be freely download from:
- here.
4.1.4 Administrative boundaries
The administrative divisions are obtained from GeoBoundaries[^2]. GeoBoundaries Built by the community and William & Mary geoLab, the geoBoundaries Global Database of Political Administrative Boundaries Database is an online, open license (CC BY 4.0) resource of information on administrative boundaries (i.e., state, county) for every country in the world. Since 2016, we have tracked approximately 1 million boundaries within over 200 entities, including all UN member states.
Suggested citation:
- Runfola D, Anderson A, Baier H, Crittenden M, Dowker E, Fuhrig S, et al. (2020) geoBoundaries: A global database of political administrative boundaries. PLoS ONE 15(4): e0231866. https://doi.org/10.1371/journal.pone.0231866.
It is possible to find additional information:
- here.
The data can be freely download from:
- here.